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Abstract 

Quickly moving to a new area of research 
is painful for researchers due to the vast 
amount of scientific Uterature in each field 
of study. One possible way to overcome 
this problem is to summarize a scientific 
topic. In this paper, we propose a model of 
summarizing a single article, which can be 
further used to summarize an entire topic. 
Our model is based on analyzing others' 
viewpoint of the target article's contribu- 
tions and the study of its citation summary 
network using a clustering approach. 

1 Introduction 

It is quite common for researchers to have to 
quickly move into a new area of research. For 
instance, someone trained in text generation may 
want to learn about parsing and someone who 
knows summarization well, may need to learn 
about question answering. In our work, we try to 
make this transition as painless as possible by au- 
tomatically generating summaries of an entire re- 
search topic. This enables a researcher to find the 
chronological order and the progress in that par- 
ticular field of study. An ideal such system will re- 
ceive a topic of research, as the user query, and will 
return a summary of related work on that topic. In 
this paper, we take the first step toward building 
such a system. 

Studies have shown that different citations to the 
same article often focus on different aspects of that 
article, while none alone may cover a full descrip- 
tion of its entire contributions. Hence, the set of 
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citation summaries, can be a good resource to un- 
derstand the main contributions of a paper and how 
that paper affects others. The citation summary of 
an article (A), as defined in ( ,Elkiss et al., 2008| l, 
is a the set of citing sentences pointing to that ar- 
ticle. Thus, this source contains information about 
A from others' point of view. Part of a sample ci- 
tation summary is as follows: 

In the context of DPs, this edge based factorization method 
was proposed by (Eisner, 1996). 

Eisner (1996) gave a generative model with a cubic parsing 
algorithm based on an edge factorization of trees. 

Eisner (Eisner, 1996) proposed an 

0(n'^) parsing algorithm for PDG. 

If the parse has to be projective, Eisner's 
bottom-up-span algorithm (Eisner, 1996) can be 
used for the search. 

The problem of summarizing a whole scientific 
topic, in its simpler form, may reduce to summa- 
rizing one particular article. A citation summary 
can be a good resource to make a summary of a 
target paper Then using each paper's summary 
and some knowledge of the citation network, we'll 
be able to generate a summary of an entire topic. 
Analyzing citation networks is an important com- 
ponent of this goal, and has been widely studied 



before (Newman, 2001 



Our main contribution in this paper is to use ci- 
tation summaries and network analysis techniques 
to produce a summary of a single scientific article 
as a framework for future research on topic sum- 
marization. Given that the citation summary of any 
article usually has more than a few sentences, the 
main challenge of this task is to find a subset of 
these sentences that wiU lead to a better and shorter 
summary. 



Cluster 


Nodes 


Edges 


DP 


167 


323 


PBMT 


186 


516 


Summ 


839 


1425 


QA 


238 


202 


TE 


56 


44 



Table 1 : Clusters and their citation network size 



1.1 Related Work 

Although there has been work on analyzing ci- 
tation and collaboration networks ( |Teufel et al, 
2006 Newman, 200 1| | and scientific article sum- 
marization ( Teufel and Moens, 2002] ), to the 



knowledge of the author there is no previous work 
that study the text of the citation summaries to 



produce a summary. ( Bradshaw, 2003 ; Bradshaw, 
[2002j ) get benefit from citations to determine the 
content of articles and introduce "Reference Di- 
rected Indexing" to improve the results of a search 
engine. 

In other work, ( jNanba et al, 2004bt |Nanba et 



al, 2004a) analyze citation sentences and automat- 
ically categorize citations into three groups using 
160 pre-defined phrase -based rules. This catego- 
rization is then used to build a tool for survey gen- 



eration. (Nanba and Okumura, 1999 1 also discuss 



the same citation categorization to support a sys- 



tem for writing a survey. (Nanba and Okumura, 
1999} [Nanba et al, 2004b[ ) report that co-citation 



implies similarity by showing that the textual simi- 
larity of co-cited papers is proportional to the prox- 
imity of their citations in the citing article. 

Previous work has shown the importance of 
the citation summaries in understanding what a 
paper says. The citation summary of an article 
A is the set of sentences in other articles which 
cite A. ( |Elkiss et al., 200 81) performed a large- 
scale study on citation summaries and their impor- 
tance. They conducted several experiments on a 
set of 2,497 articles from the free PubMed Cen- 
tral (PMC) repositorjQ Results from this exper- 
iment confirmed that the "Self Cohesion" dElkiss 



et al., 20081 of a citation summary of an arti- 



cle is consistently higher than the that of its ab- 



stract. (Elkiss et al., 2008 1 also conclude that ci- 



tation summaries are more focused than abstracts, 
and that they contain additional information that 



does not appear in abstracts. (Kupiec et al., 1995 1 
use the abstracts of scientific articles as a target 
summary, where they use 188 Engineering Infor- 
mation summaries that are mostly indicative in na- 



ture. Abstracts tend to summarize the documents 
topics well, however, they don't include much use 
of metadata. (Kan et al, 2002) use annotated bib- 
liographies to cover certain aspects of summariza- 
tion and suggest guidelines that summaries should 
also include metadata and critical document fea- 
tures as well as the prominent content-based fea- 
tures. 

Siddharthan and Teufel describe a new task to 



decide the scientific attribution of an article ( |Sid- 
'dharthan and Teufel, 2007 1 and show high human 
agreement as well as an improvement in the per- 



formance of Argumentative Zoning ( Teufel, 2005 1. 
Argumentative Zoning is a rhetorical classification 
task, in which sentences are labeled as one of Own, 
Other, Background, Textual, Aim, Basis, Contrast 
according to their role in the author's argument. 
These all show the importance of citation sum- 
maries and the vast area for new work to analyze 
them to produce a summary for a given topic. 

2 Data 

The ACL Anthology is a collection of papers from 
the Computational Linguistics journal, and pro- 
ceedings from ACL conferences and workshops 
and includes almost 11,000 papers. To produce 
the ACL Anthology Network (AAN), ( [Joseph an"d 



' http://www.pubmedcentral.gov 



Radev, 2007) manually performed some prepro- 
cessing tasks including parsing references and 
building the network metadata, the citation, and 
the author collaboration networks. 

The full AAN includes all citation and collabo- 
ration data within the ACL papers, with the citation 
network consisting of 8, 898 nodes and 38, 765 di- 
rected edges. 

2.1 Clusters 

We built our corpus by extracting small clusters 
from the AAN data. Each cluster includes pa- 
pers with a specific phrase in the title or con- 
tent. We used a very simple approach to col- 
lect papers of a cluster, which are likely to be 
talking about the same topic. Each cluster con- 
sists of a set of articles, in which the topic 
phrase is matched within the title or the content 
of papers in AAN. In particular, the five clus- 
ters that we collected this way, are: Dependency 
Parsing (DP), Phrased Based Machine Translation 
(PBMT), Text Summarization (Summ), Question 
Answering (QA), and Textual Entailment (TE). 
Table [T] shows the number of articles and citations 
in each cluster. For the evaluation purpose we 





ACL-ID 


Title 


Year 


CS Size 




L9o-105a 


Ihree New Probabilistic Models Jr^or Dependency Parsing: An Exploration 


1996 


66 




P97-1003 


T"}irpp rrpnprativp T PYipaliypH IVToHpIq Pni" Stati^tipfil Parsirnr 

±111 ^11^1 £111 V 1 ^^jYl^llllZj^U IVl^U^lS 1. Wl O iLlllS ll^lll L lUlMIII^ 


1997 


55 


o 


P99-1065 


A Statistical Parser For Czech 


1999 


54 






Pseudo-Projective Dependency Parsing 


2005 


40 




rUj- iUlZ 


On-line Large-Margin Xraining Of Dependency Parsers 


2005 


71 




N03-1017 


Statistical Phrase-Based translation 


2003 


180 




VV \j3 \.iD\j 1 


An Pvfilnatinn pYPrpiQp Pnr WnrH Ahornnpnt 


2003 


14 


§ 

CQ 


TOzi Anno 


T^ri*=> A 1 1 fTnmf^nt Tp^Tnnlntf* Annronfri Tr» vtfiti ctir'nl A^mr'riiriP* T^vcincl ntir\n 
1 lie rtliglllllClIl ICiiiUIcllC rtUUlUtlUIl ID OLtlLlbliCcli iVltlUIlillC 1 i d.llaid,LHJIl 


2004 


50 


Oh 




lliiUiU VCiiiClILS 111 X^llldSC JJclSCU. O LdLla LlUd.1 iVltlLlllllC 1 1 clllSltlLUJll 


2004 


24 






A Hierarchical Phrase-Based IVIodel Por Statistical Machine Translation 


2005 


65 




A 1/1/11 


Sentence R.eduction Por Automatic lext Summarization 


2000 


19 




AAA 9094 


f^iit AnH Pnctp RjiqpH TpyJ SiiTnTnnri7ntinn 

U L /T.11U. ± U CISCU. itA-l O UiliillCll IZjClllDil 


2000 


20 
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V^CllllUlQ-DaftCQ OUllllllclll/itlllUll \Ji iVlUILipiC L^Ut-UlllCIlLa. OCIILCIIUC CAllaCllUll, ... 


zuuu 


J 1 




WUj-UjU) 


The Potential And Limitations Of Automatic Sentence Extraction For Summarization 


2003 


15 






A Question Answering System Supported By Intormation Lxtraction 


2000 


13 




WOO-0603 


A Rule-Based Question Answering System For Reading Comprehension Tests 


2002 


19 


< 

a 


P02-1006 


Learning Surface Text Patterns For A Question Answering System 


2002 


74 




D03-1017 


Towards Answering Opinion Questions: Separating Facts From Opinions ... 


2003 


42 




P()3-I0{)1 


Offline Strategies For Online Question Answering: Answering Questions Before They Are Asked 


2003 


27 




D04-9907 


Scaling Web-Based Acquisition Ot Lntailment Relations 


2004 


12 




H05-1047 


A Semantic Approach To Recognizing Textual Entailment 


2005 


8 


u 

H 


H05-1079 


Recognising Textual Entailment With Logical Inference 


2005 


9 




W05-1203 


Measuring The Semantic Similarity Of Texts 


2005 


17 




P05-1014 


The Distributional Inclusion Hypotheses And Lexical Entailment 


2005 


10 



Table 2: Papers chosen from clusters for evaluation, with their publication year, and citation summary 
size 



chose five articles from each cluster. Table|2]shows 
the title, year, and citation summary size for the 5 
papers chosen from each cluster. The citation sum- 
mary size of a paper is the size of the set of citation 
sentences that cite that paper. 

3 Analysis 

3.1 Fact Distribution 

We started with an annotation task on 25 papers, 
listed in Table |2] and asked a number of annota- 
tors to read the citation summary of each paper 
and extract a list of the main contributions of that 
paper. Each item on the list is a non-overlapping 
contribution (fact) perceived by reading the cita- 
tion summary. We asked the annotators to focus 
on the citation summary to do the task and not on 
their background on this topic. 

As our next step we manually created the union 
of the shared and similar facts by different anno- 
tators to make a list of facts for each paper. This 
fact list made it possible to review all sentences in 
the citation summary to see which facts each sen- 
tence contained. There were also some unshared 
facts, facts that only appear in one annotator's re- 
sult, which we ignored for this paper. 

Table [3] shows the shared and unshared facts ex- 
tracted by four annotators for P99-1065. 

The manual annotation of P99-1065 shows that 
the fact "Czech DP" appears in 10 sentences out 
of all 54 sentences in the citation summary. This 
shows the importance of this fact, and that "Depen- 





Fact 


Occurrences 




u 


"Czech DP" 


10 




h 


"lexical rules" 


6 


tu 


h 


"POS/ tag classification" 


6 


S3 
J3 


h 


"constituency parsing" 


5 


on 


h 


"Punctuation" 


2 




h 


"Reordering Technique" 


2 




fr 


"Flat Rules" 


2 


Unshared 


"Dependency conversion" 
"80% UAS" 
"97.0% F-measure" 
"Generative model" 
'Relabel coordinated phrases" 
'Projective trees" 
"Markovization" 



Table 3: Facts of P99-1065 



dency Parsing of Czech" is one of the main contri- 
butions of this paper. Table [3] also shows the num- 
ber of times each shared fact appears in P99-1065's 
citation summary sorted by occurrence. 

After scanning through all sentences in the ci- 
tation summary, we can come up with a fact dis- 
tribution matrix for an article. The rows of this 
matrix represent sentences in the citation summary 
and the columns show facts. A 1 value in this ma- 
trix means that the sentence covers the fact. The 
matrix D shows the fact distribution of P99-1065. 
IDs in each row show the citing article's ACL ID, 
and the sentence number in the citation summary. 
These matrices, created using annotations, are par- 
ticularly useful in the evaluation process. 
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1 








W05-1505:7 
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1 











W05-1505:8 
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W05-1518:54 
























3.2 Similarity Measures 

We want to build a network with citing sentences 
as nodes and similarities of two sentences as edge 
weights. We'd like this network to have a nice 
community structure, whereby each cluster corre- 
sponds to a fact. So, a similarity measure in which 
we are interested is the one which results in high 
values for pairs of sentences that cover the same 
facts. On the other hand, it should return a low 
value for pairs that do not share a common contri- 
bution of the target article. 

The following shows two sample sentences from 
P99-1065 that cover the same fact and we want the 
chosen similarity measure to return a high value 
for them: 

So, Collins et al (1999) proposed a tag classification for 
parsing the Czech treebank. 

The Czech parser of Collins et al (1999) was run on a dif- 
ferent data set.. . 
Conversely, we'd like the similarity of the two fol- 
lowing sentences that cover no shared facts, to be 
quite low: 

Collins ( 1999) explicitly added features to his parser to im- 
prove punctuation dependency parsing accuracy. 

The trees are then transformed into Penn Treebank 
style constituencies- using the technique described in 
(Collins et al, 1999). 
We used P99-1065 as the training sample, on 
which similarity metrics were trained, and left the 
others for the test. To evaluate a similarity mea- 
sure for our purpose we use a simple approach. For 
each measure, we sorted the similarity values of all 
pairs of sentences in P99-1065's citation summary 
in a descending order. Then we simply counted the 
number of pairs that cover the same fact (out of 172 
such fact sharing pairs) in the top 100, 200 and 300 
highly similar ones out of total 2, 862 pairs. Table 
[4] shows the number of fact sharing pairs among 
the top highest similar pairs. Table [4] shows how 
cosine similarity that uses a tf-idf measure outper- 
forms the others. We tried three different poli- 
cies for computing IDF values to compute cosine 



Measure 


Top 100 


Top 200 


Top 300 


tf-idf (General) 


34 


66 


74 


tf-idf (AAN) 


34 


56 


74 


LCSS 


26 


37 


54 


tf 


24 


34 


46 


tf2gen 


13 


26 


35 


tf-idf (DP) 


16 


26 


28 


Levenshtein 


2 


9 


16 



Table 4: Different similarity measures and their 
performances in favoring fact-sharing sentences. 
Each column shows the number of fact-sharing 
pairs among top highly similar pairs. 



similarity: a general IDF, an AAN-specific IDF 
where IDF values are calculated only using the 
documents of AAN, and finally DP-specific IDF 
in which we only used all-DP data set. Table [4] 
also shows the results for an asymmetric similarity 
measure, generation probability ( Erkan, 2006| ) as 
well as two string edit distances: the longest com- 



mon substring and the Levenshtein distance (Lev- 
[enshtein, 1966 ). 

4 Methodology 

In this section we discuss our graph clustering 
method for article summarization, as well as other 
baseline methods used for comparisons. 

4.1 Network-Based Clustering 

The Citation Summary Network of an article A is 
a network in which each sentence in the citation 
summary of A is a node. This network is a com- 
plete undirected weighted graph where the weight 
of an edge between two nodes shows the similarity 
of the two corresponding sentences of those nodes. 
The similarity that we use, as described in sec- 
tion |3]2] is such that sentences with the same facts 
have high similarity values. In other words, strong 
edges in the citation summary network are likely 
to indicate a shared fact between two sentences. 
A graph clustering method tries to cluster the 
nodes of a graph in a way that the average intra- 
cluster similarity is maximum and the average 
inter-cluster similarity is minimum. To find the 
communities in the citation summary network we 



use (Clauset et al., 2004 1, a hierarchical agglom- 
eration algorithm which works by greedily opti- 
mizing the modularity in a linear running time for 
sparse graphs. 

To evaluate how well the clustering method works, 
we calculated the purity for the clusters found of 
each paper. Purity ([Manning et al., 2008[) is a 



method in which each cluster is assigned to the 
class with the majority vote in the cluster, and then 





ACL-ID 


#Facts |C| 


#CIiisters |a| 


Purity{n, C) 




C%-1058 


4 


4 


0.636 




P97-l()03 


5 


5 


0.750 


Q 


P99-1065 


7 


7 


0.724 




P05-1013 


5 


3 


0.689 




P05-l()12 


7 


5 


0.500 




N03-1017 


8 


4 


0.464 


H 


W03-0301 


3 


3 


0.777 


s 

CQ 


J04-4002 


5 


5 


0.807 




N04-1033 


5 


4 


0.615 




P05-l()33 


6 


5 


0.650 




AOO-1043 


5 


4 


0.812 


e 


AOO-2024 


5 


2 


0.333 


e 

3 


COO- 1072 


3 


4 


0.857 




WOO-0403 


6 


4 


0.682 




W03-05I0 


4 


3 


0.727 




AOO-1023 


3 


2 


0.833 




WOO-0603 


7 


4 


0.692 


< 

a 


P02-1006 


7 


5 


0.590 


D03-1017 


7 


4 


0.500 




P03-1001 


6 


4 


0.500 




D04-9907 


7 


3 


0.545 


M 
H 


H05-1047 


4 


3 


0.833 


H05-1079 


5 


3 


0.625 




W05-I203 


3 


3 


0.583 




P05-1014 


4 


2 


0.667 



Table 5: Number of real facts, clusters and purity 
for each evaluated article 



the accuracy of this assignment is measured by di- 
viding the number of correctly assigned documents 
by A^. More formally: 

purity(0, — ^ '^J I 

k ^ 



where Q = {wi, • • • ,^k} is the set of clus- 
ters and C = {ci, C2, . . . , cj} is the set of classes. 
(jjk is interpreted as the set of documents in and 
Cj as the set of documents in cj. For each evalu- 
ated article, Table |5] shows the number of real facts 
(|C| = J), the number of clusters = K) and 
purity{Q., C) for each evaluated article. Figure [T] 
shows the clustering result for J04-4002, in which 
each color (number) shows a real fact, while the 
boundaries and capital labels show the clustering 
result. 

4.1.1 Sentence Extraction 

Once the graph is clustered and communities are 

formed, to build a summary we extract sentences 
from the clusters. We tried these two different sim- 
ple methods: 

• Cluster Round-Robin (C-RR) 
We start with the largest cluster, and extract 
sentences in the order they appear in each 
cluster. So we extract first, the first sentences 
from each cluster, then the second ones, and 
so on, until we reach the summary length 
limit Previously, we mentioned that facts 
with higher weights appear in greater num- 
ber of sentences, and clustering aims to clus- 
ter such fact-sharing sentences in the same 




Figure 1: Each node is a sentence in the citation 
summary for paper J04-4002. Colors (numbers) 
represent facts and boundaries show the clustering 
result 

communities. Thus, starting with the largest 
community is important to ensure that the 
system summary first covers the facts that 
have higher frequencies and therefore higher 
weights. 

• Cluster Lexrank (C-lexrank) 
The second method we used was Lexrank 



( Erkan and Radev, 2004 ) inside each cluster 



In other words, for each cluster ilj we made a 
lexical network of the sentences in that clus- 
ter (Ni) . Using Lexrank we can find the 
most central sentences in A'^; as salient sen- 
tences of to include in the main summary. 
We simply choose, for each cluster 17 j, the 
most salient sentence of 17^, and if we have 
not reached the summary length limit, we do 
that for the second most salient sentences of 
each cluster, and so on. The way of ordering 
clusters is again by decreasing size. 
Table [6] shows the two system summaries gen- 
erated with C-RR and C-lexrank methods for P99- 
1065. The sentences in the table appear as they 
were extracted automatically from the text files of 
papers, containing sentence fragments and malfor- 
mations occurring while doing the automatic seg- 
mentation. 

4.2 Baseline Methods 

We also conducted experiments with two baseline 
approaches. To produce the citation summary we 
used Mead's ( |Radev et al., 2004[ ) Random Sum- 



mary and Lexrank (Erkan and Radev, 2004 1 on 
the entire citation summary network as baseline 
techniques. Lexrank is proved to work well in 
multi-document summarization ( [Erkan and Radev, 
2004| ). It first builds a lexical network, in which 



ID 


Sentence 


C-RR 


W()5-15()5:9 

W04-2407:27 

J03-4003:33 

H05-1066:51 

E06-1011:21 


3 Constituency Parsing for Dependency Trees A pragmatic justification for using constituency- based pai'ser in order 

to predict dependency struc- tures is that currently the best Czech dependency- tree parser is a constituency-based parser (Colfins et al, 1999; Zeman, 2004). 
However, since most previous studies instead use the mean attachment score per word (Eisner, 1996; Collins et al, 1999), we will give this measure as well. 
3 We find lexical heads in Penn Treebank data using the rules described in Appendix A of Collins (1999). 
Furthermore, we can also see that the MST parsers perform favorably compared to the more powerful 

lexicalized phr ase- structure parsers, such as those presented by Collins et al (1999) and Zeman (2004) that use expensive 0(n5) parsing al- gorithms. 

5.2 Czech Results For the Czech data, we used the predefined train- ing, development and testing split 

of the Prague Dependency Treebank (Hajic et al, 2001), and the automatically generated POS tags supplied with the data, 

which we reduce to the POS tag set from Collins et al (1999). 


C-Lexrank 


P05-1012:16 
W04-2407:26 

W05-1505:14 

P06-l()33:31 

P05-1012:17 


The Czech parser of Collins et al (1999) was lun on a different data set and most other dependency parsers are evaluated using English. 

More precisely, parsing accuracy is measured by the attachment score, which is 

a standard measure used in studies of dependency parsing (Eisner, 1996; Collins et al, 1999). 

In an attempt to extend a constituency-based pars- ing model to train on dependency trees, 

Collins transforms the PDT dependency trees into con- stituency trees (Collins et al, 1999). 

More specifi- cally for PDT, Collins et al (1999) relabel coordi- nated phrases after converting dependency struc- tures to phrase 
structures, and Zeman (2004) uses a kind of pattern matching, based on frequencies of the parts- of- speech of conjuncts and conjunc- tions. 
In par- ticular, we used the method of Collins et al (1999) to simplify part-of-speech tags since 
the rich tags used by Czech would have led to a large but rarely seen set of POS features. 



Table 6: System Summaries for P99-1065. (a) Using C-RR, (b) using C-Lexrank with length of 5 
sentences 



nodes are sentences and a weighted edge between 
two nodes shows the lexical similarity. Once this 
network is built, Lexrank performs a random walk 
to find the most central nodes in the graph and re- 
ports them as summary sentences. 

5 Experimental Setup 

5.1 Evaluation Method 

Fact-based evaluation systems have been used in 



several past projects (Lin and Demner-Fushman,| 



2006 



Marton and Radul, 2006[ ), especially in 
the TREC question answering track. ( |Lin and 



Demner-Fushman, 20061 use stemmed unigram 
similarity of responses with nugget descriptions to 



produce the evaluation results, whereas (Marton 



and Radul, 2006 1 uses both human judgments and 
human descriptions to evaluate a response. 

An ideal summary in our model is one that cov- 
ers more facts and more important facts. Our def- 
inition for the properties of a "good" summary of 
a paper is one that is relatively short and consists 
of the main contributions of that paper. From this 
viewpoint, there are two criteria for our evaluation 
metric. First, summaries that contain more impor- 
tant facts are preferred over summaries that cover 
fewer relevant facts. Second, facts should not be 
equally weighted in this model, as some of them 
may show more important contributions of a pa- 
per, while others may not. 

To evaluate our system, we use the pyra- 



mid evaluation method (Nenkova and Passonneau, 



2004 1 at sentence level. Each fact in the citation 



summary of a paper is a summarization content 



unit (SCU) ( Nenkova and Passonneau, 2004 1, and 



the fact distribution matrix, created by annotation, 
provides the information about the importance of 
each fact in the citation summary. 

The score given by the pyramid method for a 
summary is a ratio of the sum of the weights of 
its facts to the sum of the weights of an optimal 
summary. This score ranges from to 1 , and high 
scores show the summary content contain more 
heavily weighted facts. We believe that if a fact 
appears in more sentences of the citation summary 
than another fact, it is more important, and thus 
should be assigned a higher weight. To weight the 
facts we build a pyramid, and each fact falls in a 
tier. Each tier shows the number of sentences a fact 
appears in. Thus, the number of tiers in the pyra- 
mid is equal to the citation summary size. If a fact 
appears in more sentences, it falls in a higher tier. 
So, if the fact /j appears times in the citation 
summary it is assigned to the tier T\f^\■ 

The pyramid score formula that we use is com- 
puted as follows. Suppose the pyramid has n tiers, 
Tj, where tier T„ on top and Ti on the bottom. The 
weight of the facts in tier Tj will be i (i.e. they ap- 
peared in i sentences). If |Ti| denotes the number 
of facts in tier Ti, and Di is the number of facts in 
the summary that appear in Ti, then the total fact 
weight for the summary is D = Y^^^^ i x Di. Ad- 
ditionally, the optimal pyramid score for a sum- 
mary with X facts, is 

Max = Er=,+i ^xi^*i+jx(^-Er=,+i l^^l) 

where J = maxj(^)Lj \Tt\ > X). Subsequently, 
the pyramid score for a summary is calculated as 

P=-D- 

Max ■ 



5.2 Results and Discussion 

Based on the described evaluation method we con- 
ducted a number of experiments to evaluate dif- 
ferent summaries of a given length. In particular, 
we use a gold standard and a random summary to 
determine how good a system summary is. The 
gold standard is a summary of a given length that 
covers as many highly weighted facts as possible. 
To make a gold summary we start picking sen- 
tences that cover new and highly weighted facts, 
until the summary length limit is reached. On the 
other hand, in the random summary sentences are 
extracted from the citation summary in a random 
manner. We expect a good system summary to be 
closer to the gold than it is to the random one. 

Table [7] shows the value of pyramid score P, for 
the experiments on the set of 25 papers. A P score 
of less than 1 for a gold shows that there are more 
facts than can be covered with a set of l^l sen- 
tences. 

This table suggests that C-lexrank has a higher 
average score, P, for the set of evaluated articles 
comparing C-RR and Lexrank. 

As mentioned earlier in section |4.1.1[ once the 
citation summary network is clustered in the C-RR 
method, the sentences from each cluster are chosen 
in a round robin fashion, which will not guarantee 
that a fact-bearing sentence is chosen. 

This is because all sentences, whether they 
cover any facts or not, are assigned to some clus- 
ter anyway and such sentences might appear as the 
first sentence in a cluster. This will sometimes re- 
sult in a low P score, for which P05-1012 is a good 
example. 

6 Conclusion and Future Work 

In this work we use the citation summaries to un- 
derstand the main contributions of articles. The 
citation summary size, in our experiments, ranges 
from a few sentences to a few hundred, of which 
we pick merely a few (5 in our experiments) most 
important ones. 

As a method of summarizing a scientific paper, 
we propose a clustering approach where commu- 
nities in the citation summary's lexical network 
are formed and sentences are extracted from sep- 
arate clusters. Our experiments show how our 
clustering method outperforms one of the cur- 
rent state-of-art multi-document summarizing al- 
gorithms, Lexrank, on this particular problem. 

A future improvement will be to use a reorder- 
ing approach like Maximal Marginal Relevance 





Article 


Gold 


Mead's Random 


Lexrank 


C-RR 


C-Iexiank 




C96-1058 


1.00 


0.27 


0.73 


0.73 


0.73 




P97-1003 


1.00 


0.08 


0.40 


0.60 


0.40 


Oh 

a 


P99-1065 


0.94 


0.30 


0.54 


0.82 


0.67 




P05-1013 


1.00 


0.15 


0.69 


0.97 


0.67 




P05-1012 


0.95 


0.14 


0.57 


0.26 


0.62 




N03-1017 


0.96 


0.26 


0.36 


0.61 


0.64 


H 


W03-0301 


1.00 


0.60 


1.00 


1.00 


1.00 


CQ 


J04-4002 


1.00 


0.33 


0.70 


0.48 


0.48 


Oh 


N04-1033 


1.00 


0.38 


0.38 


0.31 


0.85 




P05-1033 


1.00 


0.37 


0.77 


0.77 


0.85 




AOO-1043 


1.00 


0.66 


0.95 


0.71 


0.95 


s 


AOO-2024 


1.00 


0.26 


0.86 


0.73 


0.60 


s 

3 


COO- 1072 


1.00 


0.85 


0.85 


0.93 


0.93 




WOO-0403 


1.00 


0.55 


0.81 


0.41 


0.70 




W03-0510 


1.00 


0.58 


1.00 


0.83 


0.83 




AOO-1023 


1.00 


0.57 


0.86 


0.86 


0.86 




WOO-0603 


1.00 


0.33 


0.53 


0.53 


0.60 


< 

a 


P02-1006 


1.00 


0.49 


0.92 


0.49 


0.87 


D03-1017 


1.00 


0.00 


0.53 


0.26 


0.85 




P03-1001 


1.00 


0.12 


0.29 


0.59 


0.59 




D04-9907 


1.00 


0.53 


0.88 


0.65 


0.94 


m 

H 


H05-1047 


1.00 


0.83 


0.66 


0.83 


1.00 


H05-1079 


1.00 


0.67 


0.78 


0.89 


0.56 




W05-1203 


1.00 


0.50 


0.71 


1.00 


0.71 




P05-1014 


1.00 


0.44 


1. 00 


0.89 


0.78 




Mean 


0.99 


0.41 


0.71 


0.69 


0.75 



Table 7: Evaluation Results (IS"! = 5) 
(MMR) (ICarbonell and Goldstein, 199811 to re-rank 



clustered documents within each cluster in order 
to reduce the redundancy in a final summary. An- 
other possible approach is to assume the set of sen- 
tences in the citation summary as sentences talk- 
ing about the same event, yet generated in differ- 
ent sources. Then one can apply the method in- 
spired by ( Barzilay et al., 1999[ l to identify com- 
mon phrases across sentences and use language 
generation to form a more coherent summary. The 
ultimate goal, however, is to produce a topic sum- 
marizer system in which the query is a scientific 
topic and the output is a summary of all previous 
works in that topic, preferably sorted to preserve 
chronology and topicality. 
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